Is the p-value a good measure of evidence? An asymptotic consistency criterion 
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Abstract 

What are the criteria that a measure of statistical evidence should satisfy? It is argued that a measure of evidence 
should be consistent. Consistency is an asymptotic criterion: the probability that if a measure of evidence in data 
strongly testifies against a hypothesis H, then H is indeed not true, should go to one, as more and more data appear. 
The p-value is not consistent, while the ratio of likelihoods is. 
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1. Introduction 

The p-value is commonly used as a measure of evidence in a data X", against a hypothesis H\ \ the smaller the 
p-value, the stronger the evidence against H\ in the data. Recall that the p-value is the smallest level at which a test 
T(X") rejects H\. According to the typical calibration lfl8ll . the p-value smaller than 0.01 suggests a very strong 
evidence against H\ . 

Unlike the p-value, which measures evidence against a single hypothesis, the ratio of likelihoods 3 measures evi- 
dence in a data for a simple hypothesis H\, relative to a simple hypothesis H 2 . For a parametric model fx(x \ ff), the 
ratio of likelihoods r\j - f(X" \ H\)/f(X" \ H 2 ) measures evidence for H\ relative to H 2 , in data X". The value of r\2 
above a certain threshold k > 1 is taken as an evidence in favor of H\, and against H2. Values of k around 30 are 
suggested for a threshold, above which the evidence is considered very strong (cf. Jl4ll . |Q]]). 

Statistics abounds criteria for assessing quality of estimators, tests, forecasting rules, classification algorithms, 
but besides the likelihood principle discussions (cf. J2l), it seems to be almost silent on what criteria should a good 
measure of evidence satisfy. Schervish, in a notable exception b 1 16], considers a requirement of coherence, borrowed 
from the multiple comparisons theory 10]. If H : 6 e implies H' : 6 6 0' (i.e., c 0'), then the coherent measure 
of evidence gives at least as strong evidence to H' as it gives to H. The p-value is not coherent; cf. lfl6ll . In this note, 
an asymptotic criterion of consistency is introduced, and it is demonstrated that the p-value is not consistent, while 
the ratio of likelihoods satisfies the consistency requirement. 

2. Measure of evidence 

To set a formal framework, let X e be a random variable with the probability density (or mass) function 
fx(x\ ff), parametrized by 6 e c R L , and such that if 8 + ff then f x (- \ ff) + f x (- \ ff). Let 0i, 2 form a partition 
of 0, and associate ; with the hypothesis Hj, j = 1,2. Let X" = X\, . . . ,X n ~ fx(x \ 0) be a random sample from 
fx(x\ff). A measure of evidence e(H l7 H 2 ,X"), in data X", for the hypothesis H\ : X" ~ fx{x\8) where 6 e ©i, 
relative to H 2 : X" ~ f x (x \ 0) where 9 € 2 , is a mapping e{Hy , H 2 , X") : 0i x 2 x (R^)" -> R. It usually goes with 
a calibration that partitions values of e(-) into intervals, or categories. In what follows, the interest will concentrate 
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on the category S of the most extreme values of the evidence measure e(-) that correspond to the strongest evidence. 
Finally, the measure of evidence against a hypothesis Hi , relative to H2, in data X", will be denoted e(^H\ , H2, X"). 

3. Consistency requirement 

In lfl7ll . Sellke, Bayarri, and Berger stress that in applications of an evidence measure, data sets may come from 
either Hi or Hi- The authors illustrate this important point by an example of testing drugs Di , D2, D3, . . . , for an 
illness, in a series of independent experiments. The measure of evidence applied to a data set from z-th experiment, is 
used to differentiate between the hypothesis Hi that the drug D, has a negligible effect, and the alternative H2 that the 
drug D/ has a non-negligible effect. Some drugs have negligible effects, some have the non-negligible one. In other 
words, some experimental data X" come from Hi, other data sets are from Hi- This key aspect of applications of the 
evidence measure can be captured by the following two-level sampling mechanism: 

1 . First, 9 is drawn from a pdf (or pmf) p{9). 

2. Given 6, a random sample X" is drawn from fx(x \ 6). 

As the sample size n increases, it should hold, informally put, that among the data sets which, according to the 
measure of evidence strongly testify against H\, the relative number of those which in fact come from Hi, should go 
to zero. This motivates the following requirement of consistency : We say that a measure of evidence e(^Hi,H2,X") 
against Hi, relative to H2, is consistent, if 

lim Pr(H x I e{^Hi,H 2 ,XD e S) = 0. 

The probability that 9 is in @i, given that the measure of evidence e{^Hi, H2,X") strongly testifies against Hi, 
relative to H2, should go to zero, as the sample size n goes beyond any limit. 



4. Is the p-value consistent? 

The p-value is n = inf {a : T(X") e R a ), where T is a test statistic, a is the size of the test, and R a is the rejection 
region for Hi . In this section it is assumed that X is a continuous random variable and the test statistic T is such 
that it rejects Hi when the observed value t of T is large. Then the p-value is n — sup @i Pr(T > t\8). The p-value 
tt(— , -,X") as a measure of evidence against Hi does not take H2 into account. Let S = [0, as) be the interval of 
values that indicate the very strong evidence against Hi . 

Before addressing the question of consistency of the p-value in general, consider an illustrative example of the 
gaussian random variable X with the variance a 2 = 1, and let @i = {81}, ©2 = {61 + 6}, 6 > 0. Let w = p(&i), 
w e (0, 1). And, let T(X") = V«(3c - 6>i) be the test statistic, and R a = {X" : T(X") > zi- a } be the rejection region, 
with zi-a denoting the I - a quantile of the standard normal distribution. 

Under Hi, the p-value is a uniform random variable, so Pr{n{^Hi, -,X") 6 S \ ©i) = a s . Under H2, the power 
of the test is Pr(n(^Hi, - ,X") e S \ ©2) = 1 - ^(zi-a, - V^)> where 0(0 is the distribution function of the stan- 
dard normal random variable. Note that Pr(n(^Hi, -,X") e S | ©2) converges to 1, for 6 > 0. Taken together, 
limbec Pr(H x \ e(-iHi, - ,X") e S) — i_"^Z as y Thus, in this simple example, the p-value is not a consistent measure 
of evidence against Hi . 

Following the reasoning in the above example, it can be demonstrated that the p-value is inconsistent 11 . 

Proposition 1. Let ©1, ©2/orm a partition of®. Let p(6) be such that w = J Q p(6) is w e (0, 1). And, let T, R a , be 
such that Pr(7T(-iHi , •, X") e S \ ©2) — * 1, as n — * 00 (i.e., for 6 € ©2, the power of the test T converges to 1 ). Then it 
holds that 

limPr(Hi\7r^Hi,-,X1)eS)=- ^ -. (1) 

«->°o 1 - w(l — as) 

Proof. Under Hi, the p-value is uniformly distributed, so that Pr(n(^Hi, -,X") e S)\8) = as, for 9 e ©1. Thus, 
Pr(n(->Hi, - ,X") e S \ 6)p{6) — a$w. Next, under the assumption that the power of the test T goes to 1, as n — > 00, 
the probability JT Pr{n{-*Hi, -,X") e S \ 6)p(9) — > 1 — w. Taken together, it proves the Proposition. □ 
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Since the right-hand side expression in ([T) is positive, the p-value is not a consistent measure of evidence. The 
limit of the probability becomes zero only at the extreme, uninteresting case of w = 0, i.e., when no X" comes from 
H\. For the typical value of as = 0.01 and w = 1/2, the limit value of the probability is as /(l + as) — 0.0099. For 
w - 0.9, the probability is 0.0826. For w = 0.999, the probability is 0.9090, and it converges to 1, as w — > 1. The 
greater the relative presence of data sets from Hi, the higher the asymptotic probability that the data come from Hi, 
when the p-value strongly testifies against Hi . 

5. Is the ratio of likelihoods consistent? 

For point sets ©i, ©2, the ratio of likelihoods ryi of H\ relative to H2 is rn = /1//2, where fj = fx';(x" \ 
for j = 1,2. The ratio 7*12 measures the evidence in favor of Hi (and against Ht), in data X". The larger the 7*12, the 
stronger the evidence in favor of Hi (and against H2), so that S = [ks , 00), ks > 1 . 

First, consider the ratio of likelihoods r2i in the example described above. Clearly, Pr{r2\ {-'Hi, H2,X") e S \®i) = 
l-$>(logks /6 Vn+ V«(5/2), which, under the assumption 6 > 0, converges to 0, as n — > 00. And, P r(r2i(^Hi , H2, X") e 
S I ©2) = 1 — ^(log ks 15 V" - V«c>/2), which, under the assumption 6 > 0, converges to 1, as n — > 00. Thus, 
lim^oo Pr(Hi | r2i(-iffi, H2,X") e S) - 0. Hence, the ratio of likelihoods is a consistent measure of evidence, in this 
example. 

And the consistency is not accidental, as stated in the following Proposition. 

Proposition 2. For point sets ©1, ©2, and p(®i) e (0, 1), the ratio of likelihoods r2i(->Hi, H2,X") is a consistent 
measure of evidence, i.e., 

lim PriHi I r 2 i(^Hi,H 2 ,XD € S) = 0. 

n— >oo 

Proof. The claim follows from the Law of Large Numbers (LLN), applied to l/n log/2/ '/1 1 ©/, and the fact that the 
Kullback Leibler divergence is positive for distinct distributions. □ 

Recently, Bickel [3] proposed an extension of the ratio of likelihoods (see also ifioll . |[l9to to the case of general 
©1, ©2: r e l2 = sup 0i f(X" I 0)/sup 0o f(X" I 6), and suggested its use as a measure of evidence. The extended ratio of 
likelihoods reduces to the ratio of likelihoods, when ©1, ©2 are point sets. Under additional assumptions, r£, is a 
consistent measure of evidence. Before stating the result, recall that the maximum likelihood (ML) estimator 0(0) of 

6, restricted to c 0, is 0(0) = arg sup ee0 f x ,{xf{ \ 0). 

Proposition 3. Let fx(x \ 6) and ©i, ©2 be such that the maximum likelihood estimators 6j(&j), restricted to &j, are 
consistent estimators of 6, j — 1,2. And let the maximum likelihood estimators &/(©;), restricted to ©,-, converge in 
probability to some finite 8j, i,j e {1,2},/ + j. Let p( ff) be such that J Q p(6) e (0, 1). Then the extended ratio of 
likelihoods r^ 1 {-'Hi , H2, X") is a consistent measure of evidence against Hi, relative to H2- 

Proof. Under the assumed consistency and convergence of the constrained MLs, the claim follows from the LLN 
and the positivity of the Kullback Leibler divergence between two different distributions, applied to the probability 

[sup 01 Pr(ri.>k s L p(B) r n r 

Pr{r" > k s \6) in the upper bound l - — ' „ " , ,„ ,' — and the lower bound inf 0) Pr(r? >k s \6)\ L p(6) of 

Z1 ^ ~ Jq^ Pr{r 2X >ks \ 0)p(&) L Z1 J J v>\ 

Pr(6e® l \r^Hi,H 2 ,X" l )eS). □ 

6. Is the Bayes factor consistent? 

It is open to debate whether a measure of evidence can depend on a prior information. Bayesians usually measure 
evidence for Hi relative to H 2 by the Bayes Factor b n = J H f{X'{ \ 6)q(ff) d6/ J H f(X n x \ 6)q(6) d6, where q(-) is the 
prior distribution. The Bayes Factor above 150 is usually considered [9] as the very strong evidence for Hi . However, 
Lavine and Schervish 1 1(111 note that the Bayes factor does not satisfy the coherence requirement, while the posterior 
odds is coherent. Both the Bayes factor £>2i and the posterior odds p2\{H2,Hi,X") = Z?2i q(®2)/q{®i) are consistent 
measures of evidence against Hi, relative to H2- Also, in analogy with the Proposition 3, consistency of the ratio of 
posterior modes can be established. 
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7. Conclusions 



There are several measures of statistical evidence in use. Among them is the Fisherian p-value and its extensions, 
likelihood-based measures, such as the ratio of likelihoods and the extended ratio of likelihoods, as well as the Bayes 
factor and the posterior odds. What are the criteria that a measure of evidence should satisfy? Coherence (cf. Sect. 1) 
is one such a criterion. It is a logical criterion. In this note, the asymptotic criterion of consistency was introduced. 
Besides being incoherent, the p-value is also inconsistent. The ratio of likelihoods and its extension are consistent and 
coherent measures. Among the Bayesian measures of evidence, for instance the posterior odds ratio is both coherent 
and consistent. 

Notes 

"Likelihood ratio is used in the Neyman Pearson hypothesis testing. To distinguish the evidential use of the likelihood ratio from its use in 
decision making, the former is referred to as the Ratio of Likelihoods (RL). RL has a rich history, cf. ((J, fj]], @|, @], fTTIl . 1 12], 1 14], 1 15], among 
others. 

b See also Sect. 3.2 in Edward's monograph [5], and a recent work [3] of Bickel. 

c In 1 17], Sellke, Bayarri, and Berger use a Monte Carlo simulation to estimate the probability Pr(®\ \ 7r(-<Hj ; ■, X") » 0.05), for a point set @i , 
in small samples, for the p-value, and relate it to the analogous probability for the Bayes Factor, which is in the studied setting the same as the ratio 
of likelihoods. The authors do not propose an asymptotic criterion for a measure of evidence. 

d The Proposition 1 holds also for the p-value that is valid in the sense of Mudholkar and Chaubey [13]. 

to mar, in memoriam 
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